
from jupyterquiz import display_quiz
import numpy as np
At this stage, having acquired the necessary data and a preliminary idea about the models you intend to train and implement, you might be contemplating what steps to take next. How can you ensure that your model's performance isn't merely a consequence of chance in the data selection process? How do you determine if the selected model outperforms others? Additionally, what measures should be considered if the available data is limited, potentially leading to overfitting in the models?
The answer to all these questions is simple - use Cross-Validation.
Cross-validation is a method for assessing the effectiveness of a machine learning model in which the data is divided into several subsets. The model is trained on one piece of data and tested on another. We repeat the process several times to ensure that the model generalizes well to different parts of the data, and not just to a specific subset. This helps to obtain a more objective assessment of the model's effectiveness.
For sure, the chosen approach will take into account small/large, balanced/unbalanced datasets, and time series/non-time series data.
Better Performance Evaluation: Since it gives a more precise estimation of the model's ability to generalize to unseen data compared to a single train-test split.
Hyperparameters Settings: Through cross-validation and hyperparameter tuning, it can seen how the model performs across several folds. It can help in identifying hyperparameters with better performance.
Overfitting Avoidance: Hyperparameter tuning without cross-validation might lead to overfitting to a specific train-test split. Cross-validation mitigates this risk by evaluating hyperparameters across various data subsets, ensuring better generalization.
display_quiz("#qqq1")
Idea: Randomly divide the data into training and test data, the same for all models. The quality of the models and resistance to overfit are checked using test data. This is a common choice and a quick to go validation method.
Commonly used values: 80% training and 20% test, 70% training and 30% test.
However, this approach has major weaknesses - direct dependence on which data was included in the train and which in the test groups, and the following approaches solve this problem.
import warnings
warnings.filterwarnings("ignore")
We will use Cardivascular Heart disease dataset. It includes information on age, gender, height, weight, blood pressure values, cholesterol levels, glucose levels, smoking habits and alcohol consumption of over 70 thousand individuals.
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv('heart_data.csv\heart_data.csv')
X = df.drop("cardio", axis=1)
y = df["cardio"]
X_train,X_test, y_train, y_test=train_test_split(X, y, test_size=0.2, random_state=51)
print(X_train.shape)
print(X_test.shape)
(56000, 13) (14000, 13)
df.head()
| index | id | age | gender | height | weight | ap_hi | ap_lo | cholesterol | gluc | smoke | alco | active | cardio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 18393 | 2 | 168 | 62.0 | 110 | 80 | 1 | 1 | 0 | 0 | 1 | 0 |
| 1 | 1 | 1 | 20228 | 1 | 156 | 85.0 | 140 | 90 | 3 | 1 | 0 | 0 | 1 | 1 |
| 2 | 2 | 2 | 18857 | 1 | 165 | 64.0 | 130 | 70 | 3 | 1 | 0 | 0 | 0 | 1 |
| 3 | 3 | 3 | 17623 | 2 | 169 | 82.0 | 150 | 100 | 1 | 1 | 0 | 0 | 1 | 1 |
| 4 | 4 | 4 | 17474 | 1 | 156 | 56.0 | 100 | 60 | 1 | 1 | 0 | 0 | 0 | 0 |
As you can see below, the distribution of the classes is approximately the same, thus this dataset can be considered as a balanced.
y.value_counts()
0 35021 1 34979 Name: cardio, dtype: int64
The main idea of this approach is to split the whole dataset in $K$ parts of equal size and each partition is called a fold.
One fold is used for validation and other $K-1$ folds are used for training the model. To use every fold as a validation set and other left-outs as a training set, this technique is repeated $k$ times until each fold is used once. This approach results in every observation being used both in train and test groups.

Standard values:
Most commonly, the number of folds used are 5 or 10.
This validation technique is not considered suitable for imbalanced datasets as the model will not get trained properly owing to the proper ratio of each class's data. This issue could be resolved using the following method.
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score
kf =KFold(n_splits=5, shuffle=True, random_state=42)
count = 1
for train_index, test_index in kf.split(X, y):
print(f'Fold:{count}, Train set: {len(train_index)}, Test set:{len(test_index)}')
count += 1
Fold:1, Train set: 56000, Test set:14000 Fold:2, Train set: 56000, Test set:14000 Fold:3, Train set: 56000, Test set:14000 Fold:4, Train set: 56000, Test set:14000 Fold:5, Train set: 56000, Test set:14000
Now we will apply this method on different models, evaluate accuracy for every fold and output the mean.
from sklearn.linear_model import LogisticRegression
score = cross_val_score(LogisticRegression(random_state= 42), X, y, cv= kf, scoring="accuracy")
print(f'Scores for each fold are: {score}')
print(f'Average score: {"{:.2f}".format(score.mean())}')
Scores for each fold are: [0.69964286 0.69935714 0.69678571 0.69014286 0.698 ] Average score: 0.70
from sklearn.ensemble import RandomForestClassifier
score = cross_val_score(RandomForestClassifier(random_state= 42), X, y, cv= kf, scoring="accuracy")
print(f'Scores for each fold are: {score}')
print(f'Average score: {"{:.2f}".format(score.mean())}')
Scores for each fold are: [0.72757143 0.72835714 0.72542857 0.72807143 0.72271429] Average score: 0.73
from sklearn.ensemble import GradientBoostingClassifier
score = cross_val_score(GradientBoostingClassifier(random_state= 42), X, y, cv= kf, scoring="accuracy")
print(f'Scores for each fold are: {score}')
print(f'Average score: {"{:.2f}".format(score.mean())}')
Scores for each fold are: [0.73685714 0.73642857 0.73307143 0.73392857 0.73571429] Average score: 0.74
display_quiz("#qqq2")
display_quiz("#qqq3")
Applying different algorithms in Logistic Regression.
algorithms = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
for algo in algorithms:
score = cross_val_score(LogisticRegression(max_iter= 500, solver= algo, random_state= 42), X, y, cv= kf, scoring="accuracy")
print(f'Average score({algo}): {"{:.3f}".format(score.mean())}')
Average score(newton-cg): 0.721 Average score(lbfgs): 0.697 Average score(liblinear): 0.707 Average score(sag): 0.668 Average score(saga): 0.647
Sorting out different maximum leaf nodes.
max_leaf_nodes = [None, 5, 10, 15, 20]
for val in max_leaf_nodes:
score = cross_val_score(RandomForestClassifier(max_leaf_nodes= val, random_state= 42), X, y, cv= kf, scoring="accuracy")
print(f'Average score({val}): {"{:.3f}".format(score.mean())}')
Average score(None): 0.726 Average score(5): 0.723 Average score(10): 0.726 Average score(15): 0.727 Average score(20): 0.728
Also i can iterate through different types of parameters
from sklearn.model_selection import GridSearchCV
params = {
'n_estimators': [50, 100],
'max_depth': [3, 5, 7],
}
grid_search = GridSearchCV(GradientBoostingClassifier(random_state=42), params, cv=kf, scoring='accuracy')
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
print(best_params)
{'max_depth': 3, 'n_estimators': 100}
best_model = grid_search.best_estimator_
predictions = best_model.predict(X_test)
print(predictions)
[1 1 1 ... 1 1 0]
The approach is called for the term Stratum - divide subjects into subgroups called strata based on characteristics that they share (e.g., race, gender, educational attainment). Once divided, each subgroup is randomly sampled.
This is an enhanced version of the k-fold cross-validation technique. Although it too splits the dataset into k equal folds, each fold has the same ratio of instances of target variables that are in the complete dataset, helping to generalize each fold.
This enables it to work perfectly for imbalanced datasets, but not for time-series data.

import matplotlib.pyplot as plt
kf = KFold(n_splits=5, shuffle=True, random_state=42)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
models = [
('Logistic Regression', LogisticRegression()),
('Gradient Boosting', GradientBoostingClassifier()),
('Random Forest', RandomForestClassifier())
]
kfold_scores = []
stratified_kfold_scores = []
for name, model in models:
kfold_scores.append(cross_val_score(model, X_train, y_train, cv=kf, scoring='accuracy').mean())
stratified_kfold_scores.append(cross_val_score(model, X_train, y_train, cv=skf, scoring='accuracy').mean())
fig, ax = plt.subplots(figsize=(10, 6))
bar_width = 0.4
bar_positions_kfold = np.arange(len(models))
bar_positions_stratified_kfold = bar_positions_kfold + bar_width
ax.bar(bar_positions_kfold, kfold_scores, bar_width, label='K-Fold')
ax.bar(bar_positions_stratified_kfold, stratified_kfold_scores, bar_width, label='Stratified K-Fold')
ax.set_xticks(bar_positions_kfold + bar_width / 2)
ax.set_xticklabels([model[0] for model in models])
ax.set_xlabel('Models')
ax.set_ylabel('Mean Accuracy')
ax.set_title('Mean Accuracy of Models under Different Cross-Validation Techniques')
ax.legend()
def autolabel(bars):
for bar in bars:
height = bar.get_height()
ax.annotate(f'{height:.2%}',
xy=(bar.get_x() + bar.get_width() / 2, height),
xytext=(0, 3),
textcoords="offset points",
ha='center', va='bottom')
autolabel(ax.patches)
plt.show()
display_quiz("#qqq5")
This approach is used with a similar idea as K-Fold, in fact, we could even say that LooCV technique is identical to K-Fold of size N (whole dataset size). Should be mentioned, the intuitive difference between Leave-One-Out or Leave-P-Out is: While in K-Fold we specify the number of groups and size of groups themselves is defined according to the data size, in Leave-out methods we specify the size of the validation set itself and the number of groups will be defined according to the data size.

display_quiz("#qqq4")
Q. Why it is important to use cross-validation on Gradient Boosting?
{admonition} Answer
:class: tip, dropdown
Due to the algorithm of gradient boosting - it tends to overfit rapidly. The boosting learns in each iteration on errors of previous iterations, reducing error on train data with each step until meets stopping criterion for the model. In order to get the best stopping criterion, cross-validation should be applied
import numpy as np
import plotly.graph_objs as go
from sklearn.model_selection import learning_curve
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
models = {
'Logistic Regression': LogisticRegression(),
'Gradient Boosting': GradientBoostingClassifier(),
'Random Forest': RandomForestClassifier()
}
def plot_learning_curves(models, X, y, cv=None, n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
colors = ['blue', 'green', 'red']
data = []
color_index = 0
for name, model in models.items():
train_sizes, train_scores, test_scores = learning_curve(
model, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
trace1 = go.Scatter(
x=train_sizes, y=train_scores_mean,
mode='lines+markers',
name=f"{name} (Training score)",
line=dict(color=colors[color_index])
)
trace2 = go.Scatter(
x=train_sizes, y=test_scores_mean,
mode='lines+markers',
name=f"{name} (Cross-validation score)",
line=dict(color=colors[color_index])
)
color_index += 1
trace3 = go.Scatter(
x=np.concatenate([train_sizes, train_sizes[::-1]]),
y=np.concatenate([train_scores_mean - train_scores_std,
(train_scores_mean + train_scores_std)[::-1]]),
fill='tozerox',
fillcolor='rgba(0,100,80,0.2)',
line=dict(color='rgba(255,255,255,0)'),
showlegend=False
)
trace4 = go.Scatter(
x=np.concatenate([train_sizes, train_sizes[::-1]]),
y=np.concatenate([test_scores_mean - test_scores_std,
(test_scores_mean + test_scores_std)[::-1]]),
fill='tozerox',
fillcolor='rgba(255,140,0,0.2)',
line=dict(color='rgba(255,255,255,0)'),
showlegend=False
)
data.extend([trace1, trace2, trace3, trace4])
layout = go.Layout(
title='Learning Curves for Different Models',
xaxis=dict(title='Training examples'),
yaxis=dict(title='Score'),
legend=dict(x=0.7, y=1.1)
)
fig = go.Figure(data=data, layout=layout)
fig.show(renderer='notebook')
plot_learning_curves(models, X_train, y_train, cv=5)
In this graph you can see how the performance of the model changes with the amount training data and how different Cross-validation methods converge.
| Approach | Execution speed | Efficient with small datasets | Efficient with large datasets |
|---|---|---|---|
| Train Test split | ✔ | ✕ | ✔ |
| K-Fold | ✕ | ✔ | ✔ |
| LooCV | ✕ | ✔ | ✔ |